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Quasi-likelihood and/or robust 
estimation in high dimensions* 
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Abstract. We consider the theory for the high-dimensional generalized 
linear model with the Lasso. After a short review on theoretical results 
in literature, we present an extension of the oracle results to the case of 
quasi-likelihood loss. We prove bounds for the prediction error and £±- 
error. The results are derived under fourth moment conditions on the 
error distribution. The case of robust loss is also given. We moreover 
show that under an irrepresentable condition, the ^-penalized quasi- 
likelihood estimator has no false positives. 
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1. A REVIEW OF THE THEORY IN LITERATURE 

\£> ■ Consider n independent observations {(xf , Yi)}f =1 , where Yi G y C R is a 

random response variable, and Xi IS cl fixed p-dimensional vector of co- variables, 
i = 1, . . . , n. In a high-dimensional model, the number of co- variables p is much 
larger than the number of observations n. There has been much literature on 
the linear model for this situation. In that case, one assumes that 

Y i = xfp° + e i ,i = l,...,n, 

where f3° £ MP is an unknown vector of coefficients, and ei, . . . , e n are indepen- 
dent noise variables. The Lasso estimator (Tibshirani [1996]) is 



f n p 1 



i=i j=i 



The parameter A > is a regularization parameter, and ||/3||i := Ylj=ip is 
the £i-norm of /3. For the case of orthogonal design, that is, the case where the 
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columns of the n x p design matrix 



X := 




are orthogonal, the Lasso estimator is the soft-thresholding estimator (Donoho 
[1995]). We study in this paper the extension of the theoretical results for the 
Lasso estimator, to the case of generalized linear models. 

The theory for the Lasso with least squares loss is well established. We refer 
to Bunea et al. [2006], Bunea et al. [2007a], Bunea et al. [2007c], van de Geer 
[2007], Lounici [2008], Bickel et al. [2009]. See also Biihlmann and van de Geer 
[2011] and the references therein. The main results concern oracle inequali- 
ties for the prediction error ||X(/3 — Z? )]]! and variable selection properties 
of the Lasso. Oracle results say that the prediction error of the Lasso esti- 
mator is up to log-factors as good as that of an oracle that uses the least 
squares "estimator" with only the co- variables in the unknown active set So := 
{j '■ Pj 0}- Variable selection results roughly state that with large prob- 
ability the estimated active set S := {j : $j ^ 0} is with large probability 
equal to the true active set So. Both results depend on appropriate condi- 
tions: for prediction one assumes restricted eigenvalue condition (Koltchinskii 
[2009a], Koltchinskii [2009b], Bickel et al. [2009]) or compatibility conditions 
(van de Geer [2007]), and for variable selection, one assumes the neighborhood 
stability (Meinshausen and Biihlmann [2006] ) or equivalent irrepresentable con- 
dition (Zhao and Yu [2006]). Clearly, variable selection is a harder problem than 
prediction, so that one expects conditions for the former to be stronger than 
those for the latter. Indeed, van de Geer and Biihlmann [2009] show that the 
irrepresentable condition implies the compatibility condition. 

Concerning work on oracle inequalities for general loss, an earlier paper which 
uses £i-regularization in this context is Loubes and van de Geer [2002]. Here, 
the case of orthogonal design is considered (thus, it has p < n). The tech- 
nique of proof is however very much along the lines of the later proofs for non- 
orthogonal design (with possibly p > n), as developed by van de Geer [2007] 
and others. Some remarks on the proof technique can be found in van de Geer 
[2001], highlighting that with an £i-penalty one can derive oracle inequalities 
with rates faster than l/\/n, despite the fact that the penalty-term A||/3°||i it- 
self is generally of larger order than l/^/n. The case of quantile regression was 
studied in van de Geer [2003] , again only for the case of orthonormal design. In 
Tarigan and van de Geer [2006], hinge loss with ^i-penalty is studied. Here the 
design is not assumed to be orthogonal, and is in fact random. This paper does 
not use restricted eigenvalue or compatibility conditions, but rather a weighted 
eigenvalue condition. It shows that the £i-penalty leads to estimators which are 
both adaptive to the "smoothness" or "complexity" of the underlying regres- 
sion function, as well as to the "margin behavior" of the problem. The margin 
behavior expresses the amount of curvature of the theoretical risk near its mini- 
mum. The paper Bunea et al. [2007b] considers the density estimation problem. 
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In van de Geer [2007], results are derived for generalized linear models with l\- 
penalty and p possibly larger than n, assuming the compatibility condition. It 
covers the case of quadratic loss and of general Lipschitz loss, and it allows for 
random design. Similar results are in van de Geer [2008], although there the 
compatibility condition is replaced by one somewhat in the spirit conditions in 
Juditsky and Nemirovski [2011]. In Biihlmann and van de Geer [2011], one can 
find further details concerning sparsity oracle inequalities for high-dimensional 
generalized linear models. 

There is a large body of literature extending the oracle results for the lin- 
ear model to matrix versions. It is beyond the scope of this paper to review 
this work, and we only point to the generalization to robust loss, as given in 
Candes et al. [2009]. 

Within this volume, the paper Negahban et al. [2011] gives a general account 
of oracle results for high-dimensional M-estimators. After our Theorem 5.2, we 
briefly discuss its relation with Negahban et al. [2011]. 

Concerning variable selection, the fact that the irrepresentable condition is 
rather strong has led to considering modifications of the Lasso, such as two 
step procedures, and the SCAD introduced by Fan [1997] , see e.g. Wu and Liu 
[2009] for the case of quantile regression. 

Our paper focusses only on the theoretical aspects. There is much literature 
on applications of the Lasso in generalized linear models, see Wu et al. [2009] 
for example. The computational aspects are well-studied: see Friedman et al. 
[2010]. The paper Lambert-Lacroix and Zwald [2011] contains apart from the- 
ory also software descriptions and a real data example for the case of Huber 
loss. In Wang et al. [2007], ^i-regularization with least absolute deviations loss 
is studied and compared numerically with the least squares Lasso. 

We present new results for prediction and variable selection for the case of quasi- 
likelihood estimation. The findings for prediction are along the lines as those 
in van de Geer [2008], but this time completed with the compatibility condi- 
tion. The paper details and extends the findings in Biihlmann and van de Geer 
[2011]. We also show that a weighted form of the irrepresentable condition im- 
plies consistent variable selection. 

2. QUASI-LIKELIHOOD AND ROBUST LOSS 

We model the dependence of the distribution of Yi on xi via a linear function 
fgo(xi) := xf(3°, where /3° is a vector of unknown coefficients. The problem is to 
estimate /3° or the linear predictor vector fpo := X/3°, where X T := (x±, . . . , x n ). 
We study a high-dimensional situation, where the number of variables p can be 
much larger than the sample size n. (For technical reasons, we assume that p 
is at least 2.) The vector /3° is assumed to be sparse, that is, its number of 
non-zero coefficients is assumed to be small. See Subsection 2.2 for more details 
on sparsity. 
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We consider two models. The first one is a generalized linear model, with a 
given inverse link function G, that is 

Ep^a*) := no(xi) = G(xjp°), i = 1, . . . , n, 

with /3° G W a vector of unknown coefficients. The quasi- (log)likelihood func- 
tion is 

Jy V{U) 

where V : R — >■ (0, oo) is a given variance function, see also McCullagh and Nelder 
[1989]. Together, quasi-likelihood and link function define quasi-likelihood loss, 
as follows: 

Definition 2.1. The quasi-likelihood loss function is 

p(y,z) :=-Q(y,G(z)), y G y, z G R. 

In our second model, the dependence of the distribution of Y\ on Xi may be 
described through quantiles or other aspects of the distribution. In particular, 
one can define this dependence via a loss function {p(y,z) : y G y, z G R}, 
and 

ff := atgmmEfp(Yi,z) 

The generalized linear model assumes that = xj(3° for some /3° G MP. 

The robust case is the one where, for all y G y, the loss function p(y, z) is 
Lipschitz in z, with Lipschitz constant not depending on y. Without loss of 
generality one can then assume the Lipschitz constant to be equal to one. This 
leads to the following definition: 

Definition 2.2. The loss function p is robust if for all y G y, 

\p(y, z ) ~ P(V, z)\ < \z - z\,V z,z. 



Quasi-likelihood loss is sometimes robust, but there are also many examples 
where it is not. Moreover, there are many (robust) loss functions which do not 
correspond to minus quasi- likelihoods. See Section 3 for some examples. 

To handle the large p situation, one needs a regularized estimation method. Let 
us write a linear function with coefficients f3 as 

fp{x) = x T (3. 

In what follows, we sometimes, with some abuse of notation, let fp be the 
n-dimensional vector X/3 = {fp(x\), . . . , ff3(x n )) T G R n as well. 
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The ^i-norm of a vector j3 £ W is 

ll/5||i:=E Ml 
i=i 

We examine the £i-penalized estimator (3 of j3 , defined as 

^ := arg s{^ p(yi,//3(Xl)) + A|l/3|ll |- 

i=l 

Here, A > is a tuning parameter. Large values correspond to more regulariza- 
tion, which means more shrinkage of the estimator (3. The expression 

n 

-XX^M**)) 

8=1 

is called the empirical risk (at /3). For least squares loss (i.e., p(y,u) = (y — u) 2 ), 
the empirical risk is the usual sum of squares (normalized by l/n). The above 
estimator is then called the Lasso estimator (Tibshirani [1996]). 

We will study loss functions p that are either minus quasi-likelihoods or robust 
(or both). The normalized Euclidean norm on M. n is 

ll/lln := ^f T f/n, /£R". 

We will establish bounds for the "prediction error" — //?o||^, the ^i-error 

11/3 — /3° ||i, and (for the case of quasi- likelihood loss) present sufficient conditions 
for variable selection using (3. 

2.1 Convex loss 

We require throughout this paper, both for quasi-likelihood loss as well as for 
robust loss, that the map 

z i-» p{y,z) 

is convex for all y G y. This assumption is important from a computational 
point of view. It also plays a crucial role in our theory, as it allows us to prove 
that the estimator /3 is in an ^i-neighborhood of (3°. This in turn will be invoked 
to establish sup-norm bounds for f^. 

2.2 Sparsity 

The indices of the set of non-zero coefficients of j3 is called the (true) active 
set. It is denoted by 

So := {j : $ + 0}. 

Its cardinality so := \So\ is called the sparsity index of (3°. It is assumed that sq 
is relatively small, at least smaller than \Jnj log p in order of magnitude (see 
(5.3), (6.1), (7.2), (7.3) and (8.1)). The vector (3° is sparse if sq is small. 
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More generally, one can call a vector /3 sparse if it can in some sense be ap- 
proximated by a vector with only a few non-zero entries. To avoid too many 
digressions, we will not elaborate on this issue, but only present a brief outline 
after the formulation of the main oracle result (see Remark 5.5). 

2.3 Results in this paper 

As f3° is unknown, its active set So and its sparsity index so are unknown as 
well. We will show in Theorems 5.2 and 6.1 that the prediction error of the t\- 
penalized estimator (3 is, up to a log p-term, the same as that of minimizer of the 
empirical risk without penalty but with all coefficients not in So restricted to be 
zero. The latter is not an estimator, as it depends on the unknown So. It is often 
referred to as the oracle. We moreover show that a version of the irrepresentable 
condition, appropriate for quasi-likelihood loss, is sufficient for variable selection 
(see Theorem 7.3). All our results are stated in a non-asymptotic form, but to 
facilitate the interpretation, we also give asymptotic formulations. 

2.4 Organization of the paper 

The next section provides some examples of quasi-likelihood and robust loss. 
Section 4 gives the definition of the so-called compatibility constant, which will 
occur in the oracle results. Section 5 gives oracle inequalities for the prediction 
and i\ -error for quasi-likelihood loss, and Section 6 does the same for robust loss. 
In Section 7 we address the variable selection problem in the quasi-likelihood 
context. Similar arguments can be used in the robust context, but this is omit- 
ted here. Section 8 briefly discusses the case of random design, and Section 9 
concludes. The proofs are in the supplemental article van de Geer and Miiller 
[2012]. Lemmas 10.2 and 12.4 there are based on a concentration inequality 
(see Massart [2000]) and a contraction inequality (see Ledoux and Talagrand 
[1991]). These lemmas use only fourth moment assumptions, and are perhaps 
of interest in themselves. 

3. EXAMPLES OF LOSS FUNCTIONS 

3.1 Least squares loss 

The least squares criterion has 3^ = ^. It corresponds to a quasi-likelihood loss 
with variance function V{u) = 1 for all u 6 BL The link function is then the 
identity, which is the canonical link function for this case. The loss function is 
convex, but not robust. 

3.2 Logistic loss 

When the response Yi is binary, say Y{ £ {0, 1}, % = 1, . . . , n, we have 

E(Yi\ Xi ) = p(y, = i\xi). 
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In logistic regression, one takes the quasi-likelihood with variance function 
V(u) = u(l -«),«£ (0, 1), and the canonical link function 

7 (p) :=log(^-Y /xG (0,1), 

that is 

G(z)= 7 - 1 (z) = T ^, zeR. 

Hence, in this case 

p(y,z) =yz-log(l + e z ), z£R. 

Because y = {0, 1}, one sees that this leads to a robust loss function, i.e., z t- > 
p(y, z) is Lipschitz in z for all y G y. We acknowledge that logistic regression is 
not robust in the sense of having a bounded influence function (but we will in 
fact assume in Condition Al that the covariables are bounded). As in all cases 
of quasi-likelihood with canonical link function, the loss also convex. 

3.3 Binary response with other link functions 

Consider binary response Y{ G {0, 1} as in Subsection 3.2, but now with more 
general inverse link function G: 

F(Y i = l\x i ) = G(xJp°), i = l,...,n. 

If G : R — > [0, 1] is a strictly increasing symmetric distribution function, then 
quasi-likelihood loss is convex. This is because the hazard g(u)/(l — G(u)) (g 
being the derivate of G) is a decreasing function of u. When the hazard is 
uniformly bounded, quasi-likelihood loss is also robust. 

3.4 Quantile regression 

If the dependence of the distribution of Yi G K on Xi is via its a-quantile 
(0 < a < 1), we take as loss function 

p(y,z) = p(y ~ z), 

where 

p{z) = a\z\\{z > 0} + (1 - a)\z\\{z < 0}. 

This is clearly a robust loss function, but it does not correspond to a quasi- 
likelihood. 

4. THE COMPATIBILITY CONDITION 

Let S C {1, . . . ,p} be an index set with cardinality s. We define for all j3 G W, 
Psj ■= W eS}, j = l,...,p, /V :=p- p s - 
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Below, we present for constants L > the compatibility constant <f>(L, S) in- 
troduced in van de Geer [2007]. For normalized design (i.e., ||Xj|| n = 1 for all 
j, where X-,- denotes the j'-th column of X), one can view 1 — 2 (1, S)/2 as an 
£i-version of the canonical correlation between the linear space spanned by the 
variables in S on the one hand, and the linear space of the variables in S c on the 
other hand. Instead of all linear combinations with normalized ^2- norm ; we now 
consider all linear combinations with normalized £i-norm of the coefficients. For 
a geometric interpretation, we refer to van de Geer and Lederer [2012]. 

Definition The compatibility constant is 

2 (L,S):=min{ S ||/ /3 || 2 : = 1, ||/3s c ||l < L}. 



The compatibility constant is closely related to (and never smaller than) the 
restricted eigenvalue as defined in Bickel et al. [2009], which is 

0| E (L,S) = ^{^| : HMl < LWPsh}- 

The calculation of the compatibility constant is a nonlinear eigenvalue prob- 
lem (see e.g. Hein and Buehler [2010] for computational aspects of nonlinear 
eigenvalues). Lower bounds that hold with high-probablity follow for example 
if X is an i.i.d. sample from a p-dimensional vector with non-generate covari- 
ance matrix (see Section 8 for some details). See also Koltchinskii [2009a], and 
see van de Geer and Buhlmann [2009] for a discussion of the relation between 
restricted eigenvalues and compatibility. 

For oracle results, we need 0(L,So) to be strictly positive for some L > 1 
(depending on the tuning parameter A). In this paper, we take L = 3 for 
definiteness, and we require throughout that 0(3, So) > (except when we 
consider sparse approximations of the truth, see Remark 5.5). If 0(3, So) = 0, 
one sees that some conditions (e.g. condition (5.3)) become impossible. 

As we will see, all bounds in this paper involve not so much the sparsity index 
so itself, but rather the effective sparsity 

F e flfectivc(So) : 



2 (3, So)' 



Example 4.1. As a simple numerical example, let us suppose n = 2, p = 3, 
Sq = {3}, and 



X = ^ V L2/L3 1 oj- 



Thus, the sparsity index is sq = 1. One can easily verify that there is no (3 S M p 
with X/3 = and ||/3sg||i < 3||/3,s ||i. Thus, the compatibility constant 2 (3,So) 
is strictly positive. In fact, 0(3, So) is equal to the distance of Xi to line that 
connects 3Xi and —3X2, that is 0(3, Sq) = y%/13. The effective sparsity is 
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reflective (5o) — 13/2. 

Alternatively, when 

Y ^A2/13 l\ 
X = ^( v 5/13 1 oj' 

i/ien (p(3,S) = 0. T/iis zs due to f/ie sharper angle between Xi and X3. 

5. ORACLE INEQUALITIES FOR QUASI-LIKELIHOOD LOSS 
5.1 The case of least squares loss 

To appreciate the results we will present for the general case, it may be useful 
to first reconsider the standard linear model and least squares loss. Let Y = 
(Yi, . . . , Y n ) T and suppose 

Y = X/3° + e. 

Let /3 be the Lasso estimator 

= arg|^{||Y-X0||2+A||0||i}. 

Let Xj denote the j-th. column of the design matrix X. If the errors e; = 
(ei, . . . , e n ) T are independent with mean zero and the design is normalized (that 
is, ||Xj|| n = 1 for all j) one can prove that uniformly in j, the "correlations" 
e T Xj/n are small in absolute value, generally as small as O{y\ogp/n). The 
regularization parameter A is to be chosen in such a way that it "overrules" 
these correlations. Indeed, this allows one to prove the following result (see 
Biihlmann and van de Geer [2011], Theorem 6.1) by rather elementary means 
(recall the notation /« := X/3): 

Theorem 5.1. Suppose that A > 4maxi<j< P |e T Xj|/n. Then 

Wfp - f P 4l + HP - < 4A 2 r cffcctivc (5 ). 

This result says that if the effective sparsity r e fj ec ti V e(<So) is of the same order 
as the sparsity index sq := \Sq\ (i.e., if the compatibility constant stays away 
from zero) , then for a large class of error distributions the Lasso estimator with 
A x y\ogp~Jn is up to constants and a (logp)-factor as good as as the oracle 
least squares "estimator" which knows the active set So- The performance of 
ft is here measured in terms of its prediction error 1 ||X(/3 — /3°)|| 2 . Theorem 
5.1 moreover says that the l\ error converges with rate Ar e fi- ect i ve (S'o). Looking 
ahead at more general loss functions, ideas are based on quadratic approxima- 
tions, which are generally only valid in a neighborhood of /3°. This is why in 
our work, we will assume that Ar e ff ect ive(5'o) is small, say Ar e ff ect i ve (5o) < 7, 
where 7 is a sufficiently small constant. With A >c \f\ogpJn, and a compatibil- 
ity constant staying away from zero, it means we assume the sparsity index sq 
to be sufficiently smaller than y/nj log p. 

1 The prediction error of the predictor of an independent copy Ynow := fpo + e ne w of Y 
is rather \\fs — fpo\\n + u 2 , where a 2 =E||e ncw ||^. We however do not include the additional 
variance a 2 in our definition. 
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5.2 General quasi-likelihood loss 



As in the situation of the standard linear model and least squares loss, we 
will study the error \\fz — fpo\\n and the ^i-error. For prediction, one will be 
interested in estimating the mean fiQ = G(fpa) of the response variable Y. Our 
Conditions A3 and A4 below will ensure that G has a bounded derivative on 
an appropriate domain. This means that bounds for \\f^ — fpo\\ n immediately 
lead to similar bounds for \\G(f&) — G(fpo)\\ n . With some abuse of terminology, 
we refer to Wfg — fp 1 1 n ELS the prediction error. 

The theoretical properties of the ^-penalized quasi-likelihood estimator /3 de- 
pend on the tail-behavior of the error 

ti := Yi - fi (xi), i = l,...,n. 

We will need at least finite second moments of the errors. For definiteness, we 
assume the errors have finite fourth moments. With higher order moments, the 
confidence level in the oracle result of Theorem 5.2 will be larger, and when 
the errors have sup-exponential tails, one can derive exponential probability 
inequalities for prediction error and ^i-error. 

Condition A e There exist constants a > and k > such that 

max Ee 2 < a 1 , 

l<i<n 

and 

1 = 1 ^ ' 



The next conditions, Conditions A1-A4, allow us to use quadratic approxima- 
tions in a neighborhood of /3 . We assume throughout that the inverse link 
function G is increasing and that its derivative 

dG(z) 
g{z) := —±±, z e R, 
dz 

exists. We further define 

(5.1) 7(/i) := / TTj-rdu, B(n,fj, ):= / — du, n £ y, 

Jyo V[U) 7 MQ V(U) 

where yo is an arbitrary but fixed constant. We let 

(5.2) H(z) := 7 (G(z)), zeK, 

that is, H := 7 o G. Note that 7 is (up to an additive constant) the canonical 
link function. When G = 7 -1 , we get H(z) = z for all z. The term yH(z) in 
the quasi-likelihood Q(y, G{z)) containing the response y is then linear in z. In 
a sense, H measures the departure from linearity of this term. We let 

h{z) -^z~-V(Gwy 
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The quantity B(fi, [iq) is the "regret" for choosing the expectation [i instead of 
the "true" fiQ. 

Condition Al There exists a constant K\ such that 

max max \xs ;| < Kx- 

l<j<pl<i<n ' J 

We remark that Condition Al serves as normalization of the design, albeit not 
in terms of the || • || n norm but rather in supremum norm. As our results will 
be presented in non-asymptotic form, it is in principle possible to see the effect 
when, say, Kx grows with p and/or n. 

Condition A2 There exists a constant Kq such that 

ma? \fp°i x i)\ < K o- 

l<i<n 

Condition A3 With Kx and Kq given in Conditions Al and A 2 respectively, 
there exists a positive constant Ch such that for all \z\ < K x + Kq, 

l/C h < h(z) < C h . 

Condition A4 With Kx and Kq given in Conditions Al and A 2 respectively, 
there exists a constant Cy , such that for all \z\ < Kx + Kq, 

2/CV < VoG{z) < C v /2. 

Remark 5.1. There is an interplay between Conditions A e , Al and A 2. For 
example, for quadratic loss, we do not need Al and A2 when the errors are 
(sub) Gaussian. Conditions Al and A2 are imposed so that we need the Condi- 
tions A3 and A4 only in the neighborhood \z\ < Kx + Kq. As for Condition A3, 
when G is the inverse of the canonical link function 7, it holds with Ch = 1, as 
H is then the identity. For quadratic loss, and logistic loss for example (which 
have canonical link function) , Condition A4 holds as well. We actually will only 
need the lower bound for V o G in this section, and the upper bound will come 
into play in Section 7. 

To organize the constants appearing in our results, let use the short hand no- 
tation 

Ch,v ■= CyC h , 
Ch,x ■= 16ChKx, 
r(5o) := 16C/ l yr e g- ect i ve (5o). 
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Thus, up to constants T(So) is the effective sparsity. As in the case of least 
squares loss, we assume the regularization parameter A to be of order at least 
y/log p/n. The larger A, the larger the confidence level of our bounds will be (in 
Theorem 5.1 this the probability of 4maxi<j< p |e T X,-|/n < A) but then these 
bounds themselves are also larger. We introduce a variable t > to describe 
this effect, and define 

If we choose the tuning parameter A at least as large as 4A e (t), the confidence 
level will be at least 1 — a(t), where 

a(t) := a(t) := 3exp[-t] + 3k 4 /(™ct 4 ). 

The variable t is in principle arbitrary, but it is however not allowed to be ar- 
bitrarily large. As we can only apply the quadratic approximations in a neigh- 
borhood of j3° we will need to show that j3 is with large probability in such a 
neighborhood. For that reason, we cannot let the tuning parameter A to be ar- 
bitrarily large (as a large A will give slow rates): see condition (5.4) in 5.2 below. 
A reasonable choice for t is for example t x logn, in which case a(t) x 1/n. 

Theorem 5.2. Let f3 be the ^-penalized quasi-likelihood estimator. Assume 
Conditions A e and A1-A4- Suppose that 

(5.3) Xe(t)T(S ) < \. 

Take 

(5-4) 4A e (i) < A < -i-. 

1 (A)) 

With probability at least 1 — a(t), it holds that 
and 

3 

- f/3°\\l < ^C h yX 2 T(S ). 

Remark 5.2. Our result in Theorem 5.2 is comparable to Corollary 3 in 
Negahban et al. [2011], albeit that we do not assume bounded responses or canon- 
ical link function, and our compatibility condition is weaker than the there as- 
sumed restricted eigenvalue condition. On the other hand, we require (5.3), and 
only give bounds for the l\-error and prediction error, not for the i^-error. 

Remark 5.3. We have presented the result in a non- asymptotic form, but did 
not try to optimize the constants. 
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Remark 5.4. Thus, up to the compatibility constant, and taking A of order 
-y/logp/n, the prediction error is of order sologp/n: 

Wf-fX = o( s -^ 

\ n 

An oracle that knows Sq and does empirical risk minimization without penalty 
but with the restriction that all coefficients not in So are set to zero, has a 
prediction error of order so/n. We see that for not knowing Sq one pays a price 
of order log p. We moreover have 



Remark 5.5. We have presented the above oracle inequality involving the spar- 
sity of the true (3°. If the truth is not sparse, or if actually the generalized linear 
model is misspecified, one may replace the truth by a sparse linear approximation 
of the truth, and the oracle inequality involves a trade-off between the approx- 
imation error on the one hand, and the sparsity and compatibility constant on 
the other. This trade-off is of the following form. Let for an arbitrary index set 
Sc{l,..., P }, 

i s ■= arg min B n {G o /, /x ), 

f=fp s 

where B{G o /, fj, ) is the average regret 

1 n 

B(G o /, Mo ) : = - V B(G o f(xi),tio(xi)). 

Thus, is is the best approximation of f° using only the variables in S. Then 
under some regularity conditions the prediction error of B(G o fs,fJ,o) is with 
probability (1 — a) bounded by 

( - X 2 \S\ 
const, min < B(G o i s , fj, ) + 



The "const. " depends on the constants occurring in the regularity conditions, the 
constant L depends moreover on the choice of A, and the confidence level a de- 
pends on all these. For more details on this extension, we refer to Biihlmann and van de Geer 
[2011] and the references therein. 

Remark 5.6. Condition (5.3) assumes that the sparsity index sq is sufficiently 
smaller than ^n/logp, a condition we already announced in Subsection 5.1. 
This assumption plays its part in all our results: it will also be important for 
variable selection and simplifies the derivation of results for the case of random 
design. In the case of least squares loss, the assumption can be avoided, even in 
some cases with random design. It should however be noted that a large so means 
a slow rate. In particular, when the sparsity is of larger order than \fnj log p, the 
bound for the prediction error is of larger order than y^logp/n, and this cannot 
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be improved up to the log p-term. Thus, then the bounds are actually quite large 
in order of magnitude. Indeed, recall that the prediction error is \\f^ — /^oH^, 
which is the squared distance between /« and fpo. Assumption (5.3) allows to 

conclude that ||/3 — /3°||i < 1, and hence, that \f^{x{)\ < Kx + Kq for all i. 
The latter was used because we only want to require Conditions A3 and A4 
for bounded values of the argument z. When dealing with least squares loss, 
Conditions A3 and A4 hold for all z£i This means that with least squares 
loss, Assumption (5.3) can be dropped in Theorem 5.2 (see Theorem 5.1). 

Remark 5.7. The lower bound in (5.4) for the tuning parameter X depends on 
the noise level a as well as other unknown constants. In practice, one may for in- 
stance apply cross-validation. The noise level a can also be treated as additional 
parameter which can be estimated along with (3°. See Stddler and van de Geer 
[2010] for a discussion. 

6. ORACLE INEQUALITIES FOR ROBUST LOSS 

In this section, we assume throughout that p is robust loss, see Definition 2.2. 
We define for i = 1, . . . , n, 

h{z) = Ep(Yi,z\xi), z£K, 
and assume that U{z) := d 2 li(z)/dz 2 exists. 

Condition B For Kx an d Kq given in Conditions Al and A2 respectively, we 
have for some constant C[ and for all i, 

inf k{z) > 2/C,. 

\z\<K x +K Q 



Example 6.1. The least absolute deviations loss is p(y,z) := \y — z\. Let Gi 
be distribution function of Yi given xi (i = l,...,n). Then ff is the median 
of Gi and Condition B requires that Gi has a strictly positive density gi on 
{\z\ < K x + K } for alii. 



We now define 



r(5 ) := 16Q 



•so 



2 (3, S ) 



Fix some t > and define 



A £ (i) := IGK X 



2(t + logp) 



n 



The following theorem is a reformulation of results in van de Geer [2007], van de Geer 
[2007] or Biihlmann and van de Geer [2011]. 
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Theorem 6.1. Let f3 be the l\-penalized robust estimator. Assume Conditions 
Al, A2 and B. Suppose that 

(6.1) Xe(t)T (S ) < \. 

Take 

With probability at least 1 — a(t), where a(t) := 3exp[— t], it holds that 
and 

Wfp-fpoWl <^GA 2 r(5 ). 

Remark 6.1. Similar remarks can be made as for the i\-penalized quasi- 
likelihood estimator. The new element in the result is that with robustness the 
tuning parameter A does not depend on some noise level a. 

7. VARIABLE SELECTION WITH QUASI-LIKELIHOOD LOSS 

Note that the bounds for the ^i-error — given in Theorems 5.2 and 6.1, 
can be invoked to show that, with large probability, the ^-regularized estimator 
will detect most of the non-zero coefficients /3° which are large enough: for all 
j] > 0, 

#{& + 0, |$| > A/7?} > > A/r?} - ?? ||/3 - /3°||i/A. 

In other words, if a large proportion of the non-zero coefficients is sumciently 
far above the noise level in absolute value, then there will also be many true 
positives. By this argument, if all non-zero coefficients of /3° are of larger order 
than Ar(5o), we will have S D 5o, where 

S := {j : 4- + 0}. 

This section will study the false positives. We show that for the case of quasi- 
likelihood loss, an irrepresentable condition similar to Meinshausen and Biihlmann 
[2006] and Zhao and Yu [2006] implies that there are no false positives, i.e., that 
S C Sq. Such result can also be obtained for robust loss, but is omitted here. 

7.1 The case of least squares loss 

Again, as preparation, let us first consider the standard linear model and the 
least squares Lasso estimator 

P = axgTjfal\\Y-XP\\l + \\\P\\i\. 
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Let X(5) := (Xj) jG 5 be the design matrix consisting of the variables in S, and 
let 

t hl (S) :=X. T (S)X(S)/n, £i, 2 (S) := X r (S c )X(S)/n. 

In Biihlmann and van de Geer [2011] (Exercise 7.5) or van de Geer et al. [2011], 
one can find the following result. 

Theorem 7.1. Suppose that A > Ao where Xq > 2maxi<j< p \e T ~Kj\/n. Assume 
moreover the irrepresentable condition 



sup ||E 2> i(5'o)S 1 1 (S'o)r5 ||oo < 



A- Ap 

lks l|oc<l A + A 

Then S C So- 

We remark that an irrepresentable condition (see also below in Definition 7.1) 
is always rather strong. However, for exact variable selection, an irrepresentable 
condition is essentially necessary, as shown in Meinshausen and Biihlmann [2006] , 
Zhao and Yu [2006], Biihlmann and van de Geer [2011]. By thresholding the es- 
timated coefficients and refitting, or by applying the adaptive Lasso, one can 
often improve on variable selection and yet maintain a good prediction and es- 
timation error. The conditions for the latter are much less restrictive than the 
irrepresentable condition. We refer to van de Geer et al. [2011] for details. 

7.2 General quasi-likelihood loss 

The results are based on he Karush-Kuhn- Tucker (or KKT-) conditions, see 
Bertsimas and Tsitsiklis [1997]. In our context, they read as follows: 



KKT conditions We have 

n 

= -Xf. 



' i=i 



Here ||f ||oo < 1, and moreover 

tjliPj^ 0} = sign^-), j = l,..., p. 



Let 

y, _}_ \ ^ 2 

n i=l 

where 

wf :=h 2 (xJfl )VoG(xff3 ), i = l,...,n. 
Thus, £ is the weighted Gram matrix 

S = X T W 2 X/n, W 2 := diag(w^, . . . , w 2 n ). 
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We write := W~K, so that S = X^Xiy/Vi. 

Let Xvy(5) be the weighted design matrix consisting of the variables in S, and 

t ltl (S) :=y&(S)Xw(S)/n, S 2 ,i(5) :=y^(S c )X w (S)/n. 

Definition 7.1. Zei < < 1 be given. We say that the 9- irrepresentable 
condition is met for the set S if 

max ||E 2 ,i(5)E]"}(5)t5|| 00 < 9. 

IfsI|oo<i 

Here is how the ^-irrepresentable condition can be linked with variable selection. 
Theorem 7.2. Let < Ao < A. Suppose that 
(7.1) ±0-P°) = -v, 

where \vj\ < A + Ao, and Vj[3j > (A — Ao)|/3j| ; j = 1, . . . ,p. Suppose moreover 
the 9 -irrepresentable condition is met for So, with 9 < (A — Ao)/(A + Ao). Then 
ScS . 

In the proof of Theorem 7.3 below, we show that the equation (7.1) in Theorem 
7.2 holds for some v satisfying the conditions of this theorem. This allows us 
then to conclude that S C So- 

As one sees in the KKT conditions, the derivative at /3 of the loss function 
occurs. We will need to compare this by the derivative at j3°. To bring this 
to an end we need, in addition to Conditions A3 and A4, certain Lipschitz 
conditions on h and g. 

Condition A5 For Kx and Kq given in Conditions Al and A2 respectively, 
we have for all \zq\ < \z\ < Kx + Kq, and some constant L^, 

\h(z) - h(zo)\ < L h \z - z \. 



Condition A6 For Kx and Kq given in Conditions Al and A2 respectively, 
we have for all \z$\ < \z\ < Kx + Kq, and some constant L g , 

\di z ) - 9(2o)l ^ L g\ z - z o|/2. 



Remark 7.1. Under the additional Conditions A5 and A6, one can improve 
the constants in Theorem 5.2. It is also clear that Conditions A 5 and A6 hold 
for least squares and logistic loss. 
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With these new constants, we define 

Lh,v '■= (L g + LhCy)Ch, Lh,x + 16LhKx. 

We moreover let 

T e := T(Sq) := 16C/ l) yr effective (5o), 

and 

To := Tq(Sq) := 6LhyClyT e s e ctive(So). 
Fix some t > and define 



2(t + logp) 



n 



X e (t) := Ch,x& 
and 

V n 

Define 

a(t) := 9exp[-t] + 9k 4 /(na 4 ). 

Thus, up to constants, T t and To are the effective sparsity. Moreover, for t x 
logn (say), X e (t) x Xo(t) X ylog(p V n)/n and a(t) xl/n, 

We arrive at the main result of this section. 

Theorem 7.3. Let f3 be the ^-penalized quasi-likelihood estimator. Assume 
Conditions A e and A1-A6. Assume that (5.3) holds, i.e., 

X t {t)T e < 71 < I 

where 71 is given by 

X e (t) 

Assume now that 

(7.2) A e (t)r < 71 7e for some % < 1 - 71, 
as weZZ as 

(7.3) A (t)r e < 70 for some 70 < 1 - 7e - 7i- 

Assume furthermore the 8-irrepresentable condition with 

1 - 7 

< ^— — , 7 : = 7e + 7o + 7i- 
1 + 7 

With probability at least 1 — a(t), it holds that S C Sq. 
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Remark 7.2. Let us take X t (t) x Ao(t) X A x ^/logp/n. XTie constants 70, 
71 and 7 e are small, depending on the constants appearing in Conditions A e 
and A1-A6. Fixing these, they can be kept away from zero, and hence also 
the -irrepresentable condition is assumed for a value of 6 that stays away 
from zero. Conditions 7.2, and 7.2 again require that the effective sparsity is 
sufficiently smaller that y\ogp/n. Formulated differently, the results of The- 
orems 7.3 and 5.2 imply that if the 6 -irrepresentable condition holds and if 
Ineffective (£0) < 7y / logp/n for sufficiently small values of 9 and 7 (depending 
only on the constants appearing in Conditions A t and A1-A6) then with an 
appropriate choice of A x \f\ogp/n the Lasso estimator has with large probabil- 
ity prediction error T cScctivc (S Q )logp/n, l x -error T cScctive (S )^/iogp/n and no 
false positives. 

8. RANDOM DESIGN 

Consider quasi-likelihood loss. It is easy to see that under the conditions of 
Theorem 5.2, one has with large probability 

- /3°) T £(/3 - /3°) < 6C^A 2 r cffcctivc (S ). 

This follows from wf < where as in Section 7, wf = h 2 {xJf3°)V o 

G(xf /3°), i = 1, . . . , n. Let X be some other p x p positive semi-definite matrix. 
Then 

where 

Ax := max |E,- * - £j &|. 
Thus, under the conditions of Theorem 5.2, one has that with large probability 

One can verify that if Axr(<5o) is small enough, say for some jx sufficiently 
small 

(8.1) X x T(So) < jx, 

then one may reformulate the compatibility condition replacing ||//3|| 2 by (3 T Ti[3, 
and the theory for prediction and ^i-error goes through essentially without 
new arguments. One can then also establish bounds for (/3 — /3°) T S(/3 — /3°). 
Similarly, one may reformulate the ^-irrepresentable condition with £ replaced 
by £, and obtain variable selection without needing new arguments. In the case 
where £ is the population version of £, the latter built from an i.i.d. sample of 
covariables, one can show that with large probability Ax is of order ^J\ogp/n. 
In other words (and modulo the compatibility constant), then condition (8.1) is 
another instance where it is required that the sparsity sq is not of larger order 
than \/n/ log p. We refer to Biihlmann and van de Geer [2011] for more precise 
statements. 
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9. CONCLUSION 



The results of this paper show that the oracle and variable selection properties 
of the Lasso for the linear model also hold for the generalized linear model. 
We prove this under the assumption that the is sparsity sufficiently smaller 
than yn/logp. We note that the results rely heavily on the convexity of the 
loss function. This allows one to work with an unbounded parameter space. 
If the estimators are a priori restricted to lie in a given bounded set, one can 
extend the results to non-convex loss (see Stadler and van de Geer [2010] for 
the mixture model, and Schelldorfer et al. [2011] for the mixed effects model) 
and one can moreover prove oracle results for the almost linear in sq regime of 
sparsity. 



10. PROOFS FOR SECTION 5 



We begin with a simple, technical result. 
Lemma 10.1. We have under Condition Al, 

max \ fp( Xi ) - fgo(xi)\ < ||/3 - ^\\iK x . 

l<i<n 

If - ,5° ||i < 1, then under Conditions A1-A3, 

\H o ffi(xi) - Ho fpofa)] < C h \fp{xi) - fpo{xi)\, i = l,...,n. 
If we assume in addition Condition A4, then 

T3ln ft \n ft w >s \M x i) ~ fp°( x i)\ 2 ■ n 
B{Go fp( Xi ),Go fpo{xi)) > ——2 , i = 1,... ,n. 



Proof of Lemma 10.1 . The first result follows from Holder's inequality: 

- fp(xi)\ < ||/3 -/3°||i max \x itj \ < ||/3 - p%K x , i = l,...,n. 

Hence for ||/3-/3°||i < 1, 

\fp(xi)\ < K x + K , i = l,...,n. 



The second part of the lemma follows from 

\H{z) - H{z )\ < C h \z - z \, \z Q \ < \z\ <K X + K . 
For the third part, we use 

d-y 1 



dfj, V(/x) ' 

and 

d ut \ M _ Mo 
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Hence, 



Further 



d dfj, d 

d 2 d 

^^(/i, no) = ^-(/i - Mo) = V(n). 



It follows that 



B(G(z),G(z ) > -L(h(z)-H(zo)) 2 > 



□ 



The next Lemma is based on Massart's concentration inequality (Massart [2000]) 
and a contraction inequality of Ledoux and Talagrand [1991]. 



Lemma 10.2. Assume Conditions A e and A1-A3. Let t > be arbitrary and 
define 



A £ (t) := 16C h K x a 



2(t + logp) 



n 



Let for (3 G BP, H(fp) be the vector H(fp) := (H(fp( Xl )), . . . , H(fp(x n )) T , and 
e := (ei, . . . , e n ) T ■ Then for all positive M < 1, we have 



P I sup 

-0°\\i<M 



e T (H(fp)-H(fpo)\ 



In > \ e (t)M 



< 3exp[-i] + 



3k 4 



n a 



l ' 



Proof of Lemma 10.2. (To simplify the notation, we drop the explicit condi- 
tioning on {xi}f =1 .) Let t±, . . . ,r n be a Rademacher sequence (that is: n, . . . ,r n 
are i.i.d., with P(tj = 1) = Pfa = -1) = 1/2), independent of e. Let E £ (P e ) 
denote conditional expectation (probability) given e. Let 

Z := sup 

11/3-/3° || i<M 

and let 

Z r := sup 

11/3-/3° || 1<M 

be its symmetrized version. We have, using the contraction inequality (see 
Ledoux and Talagrand [1991]), and Lemma 10.1, 

E e Z r <2C\E e ( sup 

\\\P-P°\\i<M 



e 1 H(fp) - Hifpo] 



/n, 



1 / 

-^neAHo fp(xi) -Ho fpo(xi 



1 



A — 1 
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Holder's inequality gives 



sup 

-P°\\l<M 



1 n 

~ E T i € i(ff>( x i) ~ f/3°( x i)) 



i=l 



< AIE e max 

i<j<P 



1 n 
n ^ 



Moreover, by the Nemirovski moment inequality (see Diimbgen et al. [2010]), 



E e max 

i<i<p 



n 

-£- 



i=l 



< 



2 logp 



<K X 



2 \ogp 



n 



n 



1/2 



1/2 



max 

i<i<P 



1 n 

j_ \ " 2 2 

e * Xi j 



i=l 



1/2 



1 - 

-E- 



i=l 



1/2 



Thus 



E £ Z T < 2MC h K x 



2 logp 



1/2 



1 71 

77, ^— ' 

i=l 



1/2 



Next, we use that for ||/3 - /3°||i < M, by Lemma 10.1, 

|#o//3(zi)-# o/^xOI <MC h K x , i = l,...,n, 

and hence 

n , x2 1 n 

- £ e? ( IT o M*i) ~ H //J°0*) < M 2 C 2 h K x - ef. 
n i=l ^ ' n i=l 

In view of Massart's inequality (see Massart [2000]), we now obtain 

l/2\ 



PA z 



and hence 



\ > E e Z T + 2^MC h K x Q £ ef ) ) < exp[- 



< exp[— t]. 



We now use 



to get 



21ogp 2t < /logp + t 
V n ~ 



PJ Z 



5, > *M^^C„ K, ± «f) V2 ) < exp[-t] 



By integrating out, it follows that 
P( Z r > AM 



2(t + logp) 



C^xaj < exp[-t] + pQ J2 e i > 2fj2 ) 
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< exp [_t] + 



To de-symmetrize, we invoke that for all u > IMC^Kxa / y/n, 

2P(Z r > u/4) 



P(Z > u) < 



1 — 4C?a 2 M 2 K] c /(nu 2 ) 



(see Pollard [1984], or Problem 14.5 in Biihlmann and van de Geer [2011]). Ap- 
ply this with 

/ 2(t + logp) 

it = 16W MC h K x cr, 

V n 

and use that p>2 implies 21ogp > 1. Finally apply the bound ^ < 3. 

□ 

We now prove the main result of the section. 

Proof of Theorem 5.2. The proof is along the lines of Theorem 6.4 in 
Biihlmann and van de Geer [2011]. Take 

M 8C hy X 2 s 



(A-2A e )0 2 (3,5 o )' 
where A e = X e (t). Throughout the proof, we assume we are on the set 



T ■= { sup 

.||/8-/8°||i<Af 



jn < A E M . 



Note that since 4A e < A < 2 (3, So)/(16GVCf s ), it holds that 



A 



A - 2A f 



< 2, 



and 

Let 

and 
Then 



M 

t :-- 



M+\\p-p\\ x 
Pt = tp + (1- t)(3°. 

M\w-m 



oi 



i 



So if we show that ||/3 t - < Af/2, then ||/3 - < Af. 
By the assumed convexity of u >-)■ —Q(y, G(u)), we find 

1 n 

-^Q(y i ,Go/ / 3 4 (x i ))-A||A|| 1 

i=l 
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f 1 n 

>tl-J2Q(Yi,Gof $ ( Xi ))-X\\P\ 
^ n i=l 

+( x - *){- E^( y - G ° fpte)) - A ii/ 3 °iii} 

^ " i=l ' 

n 

>-^Q(y j ,Go/ /30 (x J ))-A||/3°|| 1 . 



n . 



Rewrite this as 

i^S(Go4(xO,Go^o(xO) + A||/3J 1 <6 T (i7(/^J-i?(^o))/n + A||/3 || 1 . 



n 
i=l 

where -B(-, •) and -ff are given in (5.1) and (5.2) respectively. Since \\ j3t — /3°||i < 
M < 1, we have (Lemma 10.1) 

max \f B fa) - fao(xi)\ < MKx < K x , 

l<i<n Mt 

and hence 

max \ fs(xi)\ < K x + K . 

l<i<n Mt 

Thus, by Lemma 10.1, 

n II f f ||2 



t=l 

Also (on the set T), 



■ - ^h,V 



jn < X e M. 



Hence on T 

114 - fp4l/C h y + \\\Pt\\x < KM + AH/? !)!. 



(10.1) ||4 - fpo\\i/C hy + A||(A)s§||i < KM + A||(A)s - /3 u ||i. 

Case i) If ||/3 t - > M/2, we find 

114 - f(pfjC h ,v + (A - 2A e )||(A)s Hli < (A + 2X £ )\\0t)s o ~ /3°||i. 
It follows that 



A + 2A e 



||(A)5g||i< x3^fH^°-/3 



0| 



1- 



Since A > 4A e , it holds that (A + 2A e )/(A — 2A e ) < 3. We apply the compatibility 
condition with L = 3. We also add a term (A — 2\ € )\\((3 t )s — /3°||i to left and 
right hand side of the last inequality, to find 

114 - fp>\\ 2 JC h ,v + (A - - /3°||i < 2A||(A) So - /3°||i 
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<2A^||/ / 3 t -/ /3 o|| n M3,5 ). 

Ufa - fp°\\n < 2C h y\^/(j)(3,So), 

/3 f -/3°||i< 



^ 4C h y\ 2 S = M 

' (A-2A e )</> 2 (3,5 ) 2 ' 



Case ii) If \\(3 t - p\\ x < M/2, we immediately have ||/3 - /3°||i < M. 

Hence, in both Case i) and Case ii), the conclusion is that — < M. We 
can now use the same argument with fit replaced by /3 to establish that in fact 
(on T), ||/3 - < M/2. Next, we return to (10.1) with (3 t replaced by (3: 

\\f$ - fp> \\l/C h ,v + A||/%g||i < KM + A||/3 5o - /3°||i. 

This gives 

||/^-/^||2/C7 ft)V <(A + 2A e )M/2 

A + 2A e 4C hy \ 2 s 12C hy X 2 s 
A-2A £ 2 (3, So) " ^ 2 (3,S ) " 

□ 



11. PROOFS FOR SECTION 6 

We again start out with a simple technical lemma. 

Lemma 11.1. Under Conditions Al and A 2, we have for ||/3 — /3°||i < 1, 

<i — 1 N 'A — IN ' 



i=l 



Proof of Lemma 11.1. To simplify the notation, we drop the explicit con- 
ditioning on Xi, i = 1, . . . , n. A two-term Taylor expansion shows that for all 
\z\ <K X +K , 

Ep(Y h z) -MYJ?) = \k{z)(u - ff)\ 

for some z in between z and ff. The first derivative vanishes at /" since ff is 
a minimizer of E/^YJ, z). □ 

The proof of Theorem 6.1 now goes along the same lines as Theorem 5.2. 
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12. PROOFS FOR SECTION 7 

We first prove the result linking the 0-irrepresentable condition with variable 
selection. The proof is as in Theorem 7.1 (Part 1) in Biihlmann and van de Geer 
[2011]. 

Proof of Theorem 7.2. 

By the assumptions, 

2E 1)1 (5 ) ($So ~ P So ) + 2t 1>2 (So)Ps S = -v So , 

2E 2 , 1 (S ) (k - l3° So ) + 2±2, 2 (So)0s S = -vss- 
where It follows that 

l(Ps - P° So ) + 2S^(5o)Si, 2 (5 )/35g = -t^(S )vs , 

2± 2>1 (S Q ) b So ~ #U + 2S 2 , 2 (So)/3s S = -vss 

(leaving the second equality untouched). Hence, multiplying the first equality 
by — /3g o c£ 2) i(Sb), and the second by — /3g C , 

-2/f g S 2il (So)(/35o -Pso) -2^ct 2tl {S )t^ 1 (S )t 1 , 2 {S )$ s § 

= £&*i>(A-A )||&c|| 1 . 

Subtracting the second from the first gives 

2^E 2 ,2(Sb)^sg - 2/f s £ 2 , 1 (So)^(So)£i,2(So)/3 5s 

< ^cE 2>1 (S )t^(So)vs - (A - A )||/3ssl|i. 
But by the #-irrepresentable condition, with 6 < (A — Ao)/(A + Ao), we get 

/3igE 2il (5o)E^(5o>s | < ||/3 5S ||i||S 2il (5o)E^(5o>5olloo 

< ||/3s S ||i(A + Ao)0. 
We conclude that if ||/3s?||i / 0, then 

2/3|cE 2)2 (5'o)/3s , s - 2/35gE 2> i(So)E^(£o)Ei i2 (£o)/3sc) 
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< 

The matrix 



U\ + \o)6-(\-\ )\\\ks\\i <0. 



^2,2(^0) — S2,i(5o)S 1 j;(S'o)Si i 2('S'o) 

is positive semi-definite. Hence we arrived at a contradiction. So it must hold 
that HAsgHi = 0, i.e., that S C So- □ 

Our next step is an easy technical lemma. 

Lemma 12.1. Assume conditions A1-A6. Then for all \z\ < \zq\ < Kx + Ko, 
we have 

(G(z) - G(z ))h(z) = (z- z )h 2 (z ) + R(z, z ), 

where 

\R(z,z )\ < L hy (z-z ) 2 /2, 

with 

Lh,v '■= {L g + L h Cy)Ch- 

Proof of Lemma 12.1. It holds that for some z in between z and zq, 

G(z)-G(z )=g(z)(z-z ), 



so that 
Further 
Also 
and 



\G(z) - G(z ) - g(z )(z - zq)\ < L g (z - z ) 2 /2. 
\h(z) - h(z )\ < L h \z - z \. 
g(z) = h(z)VoG(z)<C h C v /2, 

Hz) < c h . 



We now study the "normal equations" . 

Lemma 12.2. Assume Conditions A e and A1-A6 and define 



r e := r(5 ) 



16C v C 2 h s 



2 (3,S O ) ■ 
Let t > be arbitrary and define 



X e (t) := 16C h K x a 

and 



2(t + logp) 



n 



□ 



V Th 
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Suppose that 

A e (t)r e < \. 

Take 

4A e (t) < A < 1 

T/ien on with probability at least 1 — a(t), where a(t) := 9exp[— t] + 9k 4 /o~ 4 , we 
have 

±0 - /3°) = -Af + u n -K n - r n , 
where \\t\Ioo < 1 andfj(3j = \/3j\. Moreover, 

Hindoo < A e (t), 

||Rn||oo < 6X 2 L h yCl v Kxs /cf> 2 (3,So), 

and 

1 1 ^Vi 1 1 oo 

< 16C hy Ao(t)As /^ 2 (3,5o). 

Proof of Lemma 12.2. 

We have for \zq\ < \z\ < Kx + K , 

■fQ(y, G(z)) = {y- G(z ))h(z) - (G(z) - G(z ))h(z) 
az 

= (y- G(z ))h(z) -(z- z )h(z)h(z)V(G(z)), 
where z is between z and zq. Hence, 

^Q(y, G(z)) = (y- G(z ))h(z) - (z - zo)h 2 (z )V(G(z )) + (z - z ) 2 k(z, z), 

with \k(z, z)\ < Lhy/2. By Theorem 5.2, it follows that with probability at 
least 1 — a(t)/3, for j = 1, . . . ,p, 



J 1=1 



0=13 



^ n 1 n 



n 1 — ' n 

i=l i=l 



_. n -. n 



n <■ — ' n 

1=1 2=1 
^ / S 



where \kij\ < for all z and j. We can write this as 

n i=1 /3=/3 
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(12.1) = e v X w /n - S(/3 - /3 U ) + R„ + r„ 



where (ey)j = Cj/y V o G(xJ(3°) : %= 1, . . . ,n. By Lemma 12.3, with probability 
at least 1 — a(t)/3, 

||eyXvi//ra||oo < A e (i). 
Moreover, R n is a p-vector satisfying 

max |R nJ | < K x L h v\\fs - fpo\\n/ 2 - 
i<j<p H 

We can apply Theorem 5.2 to bound \\f^ — /^oH^, and so we find that 

||Rn||oo < 6X 2 L h yCl v K x s /^(3,S ). 
Furthermore, we take 

n 

r„ := - £ ^(xf /3) - /? ))^. 

1=1 

Also, by Lemma 12.4 combined with Theorem 5.2, with probability at least 
1 — 2a(t)/3, equation (12.1) is true with 

r n \\oo 

The result now follows from the KKT-conditions. 

□ 

In Lemma 12.2, we needed two results for the random terms. These are the 
following two lemmas. 



Lemma 12.3. Assume Conditions A e and A 1- A3. Let(ey)i = e «/y V o G(xff3°) 
i = 1, . . . , n. Let t > be arbitrary and define 



K(t) := 16C h K x aJ 2 -^^. 

V n 

Then 

( \ 3k 4 

P He^X^Hoo/n > X e (t) < 3exp[-t] + — 
\ J na^ 

Proof of lemma 12.3. This follows from similar (and in fact simpler) argu- 
ments as used for the proof of Lemma 10.2. □ 

Lemma 12.4. Assume Conditions A e , Al, A2 and A5. Let t > be arbitrary 
and define 

Mt) != 16 / 2(* + 21ogp) L|igjk 



We have 

> \o(t)M 

< 3exp[-t] +3k 4 /o" 4 - 



P(max sup -y^^ii ho fg(xi) - ho fgo( Xi )\. 
V<j<P\\p-po h < M nf^ V / 
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Proof of lemma 12.4. Similarly to the definitions in the proof of Lemma 10.2, 
we define for j = 1 . . . , p, 



and 



z r .- 



Z r,j 



sup 

-/3°||i<M 



sup 

-P°\\i<M 



- ho f p (xi) - ho fpo(xi) Jxij 
n i=i V / 

- y^T-^if h o fp(xi) - ho fpo(xi) W 
n i=i ^ ' 



where (n,...,r n ) is a Rademacher sequence independent of e. Also, by the 
same arguments as in the proof of Lemma 10.2, we have for j = l,...,p, 



t + logp 



n 



L h K 2 x 



, n \ 1/2 



i=l 



< exp[— t]. 



Thus, 



v v j=i ' ' 



< pexp[— (t + logp)] = exp[— 1\. 
The proof can now be finished in the same way as the one of Lemma 10.2 . □ 
Finally, we prove the main result of the section. 
Proof of Theorem 7.3. 

By Lemma 12.2, it holds with probability at least 1 — a(t), that 

IKHoo < A e (i) = 7i A, 
A 2 



|Rn||oo < A 2 ro 



X e (t) 



K(t)T 



< 



X 2 



-7i7e = A7 e , 



and 
Hence, 



X e (t) 

< AA (t)r e < A 7o . 



£0-P°) = -Xf + Rem:=v 
where Rem is a remainder term satsifying 

HRemlloo < (7, + 70 + 71) = 7, 

so that 

IMloo < A + llRemlloo < (1 + 7)A. 

Also, if /3j > 0, then fj = 1 and hence, then Vj = A+Remj > A(l — 7). Similarly, 
if ftj < 0, then fj = —1 and hence, then Vj = —A + Renij < —(1 — 7)A. The 
proof is finished by applying Theorem 7.2. 



□ 
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